Corpora of Non-Linguistic Symbol Systems

نویسندگان

  • Katherine Wu
  • Jennifer Solman
  • Ruth Linehan
  • Richard Sproat
چکیده

Humans have been writing for over 5,000 years, but in addition to linguistic symbol systems there have been many non-linguistic systems. Some examples include mathematical symbology, European heraldry, barn stars, Mesopotamian deity symbols and totem poles. While writing represents natural language units such as phonemes, syllables, morphemes or in some cases words, non-linguistic systems represent other, non-linguistic, information. Thus, mathematical symbols represent mathematical operations, functions, variables and the like. Note that it does not matter that one can read a mathematical equation using words; the elements of the equation do not represent words, or any other linguistic elements. Within the past few years, two high-profile papers have claimed to provide statistical methods to distinguish writing from non-linguistic symbol systems. The first, (Rao et al., 2009), used bigram conditional entropy to argue that the symbols used by the Indus Valley civilization constituted a writing system. The second, (Lee et al., 2010), used a different technique also based on conditional entropy to argue that Pictish symbols, found on a few hundred standing stones in Scotland, were part of a heretofore unrecognized writing system. Both of these papers were very favorably reported in the popular science press. The problem is that the techniques reported in the cited papers do not provide evidence that a system is linguistic: for example, they are easily fooled by artificial systems that are generated by non-uniform memoryless random processes. But the deeper and more important point is that in order to test any statistical method that purports to distinguish writing from non-writing, one surely needs a set of corpora of clear non-linguistic symbol systems. Few such corpora exist. The project reported in this paper fills that void by developing electronic corpora of known non-linguistic systems. To date we have developed corpora of the following systems: European heraldry; totem poles; Mesopotamian deity symbols (kudurrus) (Seidl, 1989); Vinča symbols (Winn, 1981); Pictish symbols; mathematical equations downloaded from arXiv.org; weather icon sequences from 5-day forecasts downloaded from wunderground.com, and Pennsylvania German barn stars (also known as “hex signs”) (Graves, 1984). Corpus sizes range from several hundred to several tens of thousands of symbols. All corpora are encoded using an XML-markup scheme based in part on the Text Encoding Initiative (tei-c.org) conventions. The corpora will be released under an open-source license via the Linguistic Data Consortium.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cross-linguistic and Cross-cultural Study of Epistemic Modality Markers in Linguistics Research Articles

Epistemic modality devices are believed to be one of the prominent characteristics of research articles as the commonly used genre among the academic community members. Considering the importance of such devices in producing and comprehending scientific discourse, this study aimed to cross–culturally and cross-linguistically investigate epistemic modality markers as an important subcategory...

متن کامل

A statistical comparison of written language and nonlinguistic symbol systems

Are statistical methods useful in distinguishing written language from nonlinguistic symbol systems? Some recent articles (Rao et al. 2009a, Lee et al. 2010a) have claimed so. Both of these previous articles use measures based at least in part on bigram conditional entropy, and subsequent work by one of the authors (Rao) has used other entropic measures. In both cases the authors have argued th...

متن کامل

Symbolic Learning Strategy based on Symbol Literacy Improvement

The meaning of symbol is open, based on which the learner can gradually establish his or her own symbol library and develop the capability of meaning construction. Relying on "linguistic symbol" and "non-linguistic symbol" -based symbol literacy improvement, medium literacy and cultural literacy education and training, multiple intelligence education, course ontology research, etc., we can impr...

متن کامل

Discrimination of Linguistic and Non-Linguistic Vocalizations in Spontaneous Speech: Intra- and Inter-Corpus Perspectives

We present a large-scale study on classification of linguistic and non-linguistic vocalizations including laughter, vocal noise, hesitation and consent on four corpora amounting to 46 h of spontaneous conversational speech. We consider training and testing on speaker-independent subsets of single corpora (intracorpus) as well as inter-corpus experiments where models built on one or more corpora...

متن کامل

KoGra-DB: Using MapReduce for Language Corpora

Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer linguistic annotations an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012